Semi-supervised Sequence Labeling for Named Entity Extraction based on Tri-Training: Case Study on Chinese Person Name Extraction
نویسندگان
چکیده
Named entity extraction is a fundamental task for many knowledge engineering applications. Existing studies rely on annotated training data, which is quite expensive when used to obtain large data sets, limiting the effectiveness of recognition. In this research, we propose an automatic labeling procedure to prepare training data from structured resources which contain known named entities. While this automatically labeled training data may contain noise, a self-testing procedure may be used as a follow-up to remove low-confidence annotation and increase the extraction performance with less training data. In addition to the preparation of labeled training data, we also employed semi-supervised learning to utilize large unlabeled training data. By modifying tri-training for sequence labeling and deriving the proper initialization, we can further improve entity extraction. In the task of Chinese personal name extraction with 364,685 sentences (8,672 news articles) and 54,449 (11,856 distinct) person names, an F-measure of 90.4% can be achieved.
منابع مشابه
Self-Adjustable BootStrapping for Web-Scale Named Entity Extraction using N-grams
Named Entity Extraction refers to task of identifying and extracting mentions of names like person names, locations, time expressions, monetary values etc from text. There have different approaches to Named Entity extraction and classification based on supervised and semi-supervised learning. This paper describes a bootstrapping approach to extracing Named Entities for 150 categories from Wikip...
متن کاملScientific Information Extraction with Semi-supervised Neural Tagging
This paper addresses the problem of extracting keyphrases from scientific articles and categorizing them as corresponding to a task, process, or material. We cast the problem as sequence tagging and introduce semi-supervised methods to a neural tagging model, which builds on recent advances in named entity recognition. Since annotated training data is scarce in this domain, we introduce a graph...
متن کاملA Novel Approach to Conditional Random Field-based Named Entity Recognition using Persian Specific Features
Named Entity Recognition is an information extraction technique that identifies name entities in a text. Three popular methods have been conventionally used namely: rule-based, machine-learning-based and hybrid of them to extract named entities from a text. Machine-learning-based methods have good performance in the Persian language if they are trained with good features. To get good performanc...
متن کاملData Analysis Project: Semi-Supervised Discovery of Named Entities and Relations from the Web
This project studies semi-supervised discovery of named entities, relational entities and prepositional phrase attachments within a read-the-web framework. Meanings of an entity can be improvised and updated faster in the internet world than printed references. The main idea of this project is to study the feasibility of characterizing entities by web content directly. The approach is that cont...
متن کاملSemi-supervised Statistical Inference for Business Entities Extraction and Business Relations Discovery
The sheer volume of user-contributed data on the Internet has motivated organizations to explore the collective business intelligence (BI) for improving business decisions making. One common problem for BI extraction is to accurately identify the entities being referred to in user-contributed comments. Although named entity recognition (NER) tools are available to identify basic entities in tex...
متن کامل